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Abstract 

Design considerations for molecular dynamics algorithms capable of taking advantage of the 
computational power of a graphics processing unit (GPU) are described. Accommodating the 
constraints of scalable streaming-multiprocessor hardware necessitates a reformulation of the 
underlying algorithm. Performance measurements demonstrate the considerable benefit and 
cost-effectiveness of such an approach, which produces a factor of 2.5 speed improvement over 
previous work for the case of the soft-sphere potential. 
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1. Introduction 

The ability of computers to maintain an exponential performance growth has been 
made possible by shrinking component size permitting higher levels of integration, faster 
instruction execution and a wealth of hardware capabilities including cached memory 
access, multiple instruction units, pipelined processing, and sophisticated instruction 
scheduling, to name but a few. Features leading to higher effective computation speeds 
that were once confined to costly high-performance hardware have gradually trickled 
down to the affordable CPU chips in current use. Reduced power needs also allow multi- 
ple processor cores to reside on a single chip, a recent notable example being the graphics 
processing unit, or GPU (conventional CPUs now also adopt this strategy). The latest 
GPUs are fully programmable, and some are even capable of processing hundreds of sep- 
arate data streams in parallel. Of the many different kinds of scientific and engineering 
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computations, those with a more regular data organization, matrix-vector operations for 
example, can utilize GPU hardware very effectively, while the inherent lack of systemati- 
cally arranged data in, for example, molecular dynamics - MD - simulation, complicates 
the task of effective GPU usage. 

The availability of optimized computational algorithms is essential for carrying out MD 
simulations of large systems over long time intervals. Past efforts invested in developing 
hardware-customized algorithms have tended to focus on high-end supercomputers, with 
architectures based on vector or parallel processing, or even both together; the result- 
ing algorithms can be quite efficient, but introduce additional complexity to overcome 
hardware constraints. What is special about the GPU is that it offers high computa- 
tional capability while avoiding the cost penalty of other forms of supercomputing since 
it is a byproduct of consumer product development (as, indeed, are the microprocessors 
powering modern computers in general). Effective GPU utilization also calls for spe- 
cialized algorithms, but widespread availability makes it an attractive platform for MD 
applications. 

The present paper explores the requirements for developing a GPU version of an ef- 
ficient, scalable MD simulation for simple fluid systems. Scalability is an essential char- 
acteristic of any algorithm designed for the massive parallelism intrinsic to present and 
future GPU designs, and, as will be described in detail, the approach described here is 
not subject to the limitations of earlier efforts that addressed this problem. After a brief 
outline of the GPU as it appears from a software perspective, the way the MD algorithms 
need to be modified to utilize the hardware features is described, including a short di- 
gression on programming issues specific to the kind of parallelism on which GPU design 
is based that, due to their novelty, are still relatively unfamiliar. Measurements of actual 
performance and its dependence on various features of the algorithm are examined, as is 
the payoff - actual and potential - from the effort invested in the algorithm development. 

2. GPU hardware a brief overview 

Graphics have become an integral part of computing, and the demand for increased 
capability has resulted in a gradual shift in graphics processor design from hardwired 
functionality, via software controlled vertex and pixel shaders, to the fully programmable 
GPU [1]. The reason a GPU can outperform a CPU, sometimes by orders of magnitude, is 
that it is designed to support structured floating-point intensive computation - the kind 
that lies at the heart of the graphics rendering process - rather than being optimized to 
support the high flexibility demanded from a 'conventional' CPU. When provided with a 
suitable software interface, the GPU can also be used as a high-performance coprocessor 
for non-graphics tasks and, indeed, is a likely building block for the next generation of 
supercomputers . 

The CUDA™ (compute unified device architecture) approach [2] is a recent devel- 
opment aimed at simplifying the task of constructing software to utilize complex GPU 
hardware without excessive immersion in the details, while retaining the ability to scale 
the computation as more powerful (in particular, increasingly parallel) hardware becomes 
available. Conceptually, CUDA operates at a high level of parallelism, and while in prac- 
tice concurrency is hardware limited, it exceeds that of a modern multiple-core CPU by 
a considerable factor. Parallelism is expressed through independently executed threads 
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(not to be confused with Unix threads) that are grouped into blocks; for MD compu- 
tations, since thread management costs little in terms of performance, a thread can be 
assigned to evaluate some quantity associated with just a single atom, so that there will 
be as many threads as there are atoms (without regard for the actual parallelism of the 
hardware). This represents the ultimate in fine-grained parallelism. 

The ideal program consists of a series of calls by the host CPU to execute blocks of 
threads in parallel on the GPU, together with other nonparallel tasks (hopefully not time 
consuming) needed to support this effort. Blocks are processed independently of one an- 
other, in parallel to the extent permitted by the hardware, and then sequentially (strictly 
speaking, threads are processed in smaller batches known as warps, a detail mostly in- 
visible to the software). Although threads do not communicate among themselves, the 
fact they can access high-speed shared memory and be mutually synchronized provides 
a usable hardware abstraction. If there are data-dependent conditional branches then 
groups of parallel threads are executed in series, the groups following alternative paths 
with only threads on the path enabled. 

Threads all have access to common global memory in the GPU that is separate from 
the host memory, and while there is considerable latency involved, a high bandwidth can 
result if access is correctly organized in a manner that allows memory requests by different 
threads to be coalesced (there are also other memory spaces, some with faster access, 
of more limited visibility). For those classes of problem with well-structured data, e.g., 
matrix computations, GPU performance tends to be limited only by the computation 
rate, while for others, such as MD simulation, it is the memory access rate that limits 
performance; the situation is improved somewhat both by the ability of large numbers 
of threads to help conceal memory latency and by the availability of memory caching. In 
recognition of the potential usefulness of the GPU as a numerical processor, hardware 
improvements are being aimed at eliminating the usability constraints of earlier designs, 
examples being the need for increased memory speed and flexibility, the lack of error- 
correcting memory, and support for fast double-precision arithmetic. 

3. MD algorithms 
3.1. Background 

A typical MD computation entails evaluating forces on atoms (or molecules) , integrat- 
ing the equations of motion, and measuring various properties [3]. Of these tasks, by far 
the most intensive is the force evaluation. In the case of large systems with short-range 
forces, where each atom interacts only with a very small fraction of the entire system, the 
key to efficiency is the identification of potential interaction partners with a minimum 
of effort. This can be accomplished by dividing the simulation region into cells of size 
exceeding the interaction cutoff range r c , assigning atoms to cells based on their current 
coordinates, and then only examining pairs of atoms in the same or adjacent cells. This 
reduces the computational effort for a system with N a atoms from 0(N%) to 0(N a ). It is 
worth noting that systems with long-range forces can be transformed into an essentially 
short-range problem; not doing so is extremely inefficient for large N a , but does provide a 
good starting point for learning GPU technique [4] since this naive 0(N%) method is able 
to use the same efficient, block-organized technique employed in matrix multiplication 
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[2]- 

A further performance improvement, this time by a multiplicative factor only, is ob- 
tained by using the cell-organized data to construct a list of neighbors that includes 
atom pairs with separation r < r n = r c + 5, where S is the thickness of a surrounding 
shell (after enlarging the cells accordingly); since this list can be guaranteed to include 
all pairs with r < r c over several integration time steps, the work associated with list 
construction is amortized over those steps, while the fraction of pairs encountered during 
the force evaluations with r > r c is reduced substantially. The neighbor list is updated 
when the cumulative maximum atom displacement reaches 5/2. 

The MD algorithm in this case, after setting up the initial state, involves a loop con- 
taining the following operations [3]: (a) the first part of the leapfrog integration (a half 
time step update of velocities and a full time step update of coordinates); (b) if neighbor 
list updating is required then correct the coordinates for periodic boundary crossings 
and do the rebuild; (c) compute forces and potential energy; (d) the second part of the 
leapfrog integration (a half time step update of velocities); (e) evaluate properties such as 
kinetic energy and maximum velocity (used to decide when the next neighbor list update 
is due); (f) during equilibration adjust the velocities. 

This approach is ideal for the conventional CPU; while the neighbor list itself can 
be large, depending on r n and the mean density, only minimal storage is required to 
support cell assignment owing to the use of linked lists (see below). For 'unconventional' 
processors, such as those requiring vector operations to achieve high performance, or those 
based on fine-grained parallelism such as a GPU, the hardware is incompatible with the 
efficient use of linked lists, and even simple tabulation of data about neighboring pairs 
needs to be rethought. 

The problem to be solved is as much one of data organization as it is of computa- 
tion; it is an issue that is awkward to accommodate when designing algorithms for a 
vector processor, and, to a lesser degree, for a streamed-multiprocessing GPU. The al- 
gorithm designed for the GPU will require increased storage to avoid the use of linked 
lists, a change that originated in attempts to achieve effective vectorization [3,5]; it will 
also entail significantly more computation because Newton's third law will not be used 
in order to allow more systematic memory access. Such sacrifices are justifiable when 
they contribute to the performance overall. The stages in converting the computational 
algorithm to a form that resolves the incompatibilities are described below. Alterna- 
tive GPU implementations of MD for short-range forces are described in [6,7], and the 
intermediate-range case in [8] ; these will be referred to again subsequently. 

Although Ref. [6] and the present paper share much in common, there is an essential 
difference in the way data is accessed, as described below, that ensures optimal scaling 
of the new method as hardware parallelism is increased in future GPUs, a capability not 
present in the earlier work. Furthermore, a relatively large interaction range is needed 
for the benchmarks reported in [6] to achieve efficient hardware utilization, even with 
the more limited parallelism offered by past generations of GPUs; reducing the range 
below this value, as in the soft-sphere MD example discussed below, leads to a drastic 
performance drop, whereas the efficiency of the new approach is not directly affected by 
interaction range. 

Two interaction potentials are considered. The Lennard- Jones (LJ) potential has the 
form 
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with a cutoff r c that must be specified. The soft-sphere (SP) potential is the same, except 
that r c = 2 1 / 6 cr and a constant e is added for continuity. In reduced MD units, length 
and energy correspond to a = 1, e = 1, and atoms have unit mass. 

The total number of atoms is N a — N x N y N z , where N x is the ^-component of the size 
of the ordered atom array in the initial state. The corresponding edge of the simulation 
region is of length L x — N x / p 1 ^ 3 , where p is the density. The size of the cell array used 
for identifying neighbor pairs is N c = G x G y G Zl with G x = [L x /r n \ ; cells can be indexed 
both as vectors c and scalars c = ((c z — l)G y + c y — 1)G X + c x . 



3.2. Simple neighbor lists 



The initial approach, an efficient method for general use [3], provides a basis for sub- 
sequent comparison. Neighbor list construction begins by assigning atoms to cells, based 
on coordinates, with cell contents represented as linked lists of atom indices qj. The first 
entry in the list for cell c appears in QN a +c, with subsequent entries qj all having j < N a ; 
each qj is either the identity of the next atom in the list of the owning cell, or zero if it 
the last entry. The cell edge is w x = L x /G x and the simulation region is centered at the 
origin. 

for c — 1 to N c do qN a +c = 
for i = 1 to N a do 

r ' = n + L/2 

c x = [r' x /w x \ + 1 (etc.) 

Qi = QNa + c 
QN a +c = i 

enddo 



Enumeration of neighbor pairs employs a set of nested loops, the outermost three 
scanning all cells, the next one scanning the offsets between adjacent cells (only half are 
needed, 14, including the cell itself), then the two innermost loops that scan the member 
atoms of each of the selected pair of cells; atom pairs in the same cell are only treated 
once. If the pair separation r, after allowing for periodic boundaries, satisfies the distance 
criterion, r < r n , the identities of the atom pair are saved in t' m and t'^, with a check 
(not shown) that the size limit of the pair list is not exceeded. Information about pairs 
separated by periodic boundaries is packed into j3 (the 27 possibilities are encoded in 
the 6 high-order bits) and stored along with the atom identities to avoid the need to 
repeat the tests each time the neighbor list is read (this technique is an example of how 
computation can be reduced, but it is optional, and only usable if it does not restrict the 
size of N a ); bp is the actual periodic correction to the coordinates (with components 0, 
±L X , etc.). N p is the list length. 

m = 

for c' z = 1 to G z , c'y = 1 to G y , c' x = 1 to G x do 
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for k = 1 to 14 do 

set (3 and bp for periodic boundaries (if any) 

c" = c'+ cell offset (adjust for periodic boundaries) 

i' = <lN a +c' 

do while i' > 

do while i" > 

if d ^ c" or i' < i" then 

f = ry - r v > + bp 
if r 2 < r 2 n then 
m = m + 1 

endif 
endif 

i" = ft» 
enddo 

i' = ^ 
enddo 
enddo 
enddo 

N p = m 

The evaluation of the forces /j and total interaction energy f7 follows; u(r) is the poten- 
tial energy Eq. (1), and f(r)r the derived force. The computation starts by initializing 
all fi = and U = 0, and then treats each of the _/V p neighbor pairs. The adjustments 
required to account for periodic boundaries are encoded in /?; B is a mask used to extract 
the value of (3 (B is the complement). 

for m = 1 to iVp do 

i' = t' m , i" = CkB, p^t'^kB 

f '= r t , - Tin + bp 

if r 2 < r 2 c then 

fi'=fi'+f(r)r, fi» = h» - f{r)r, U = U + u(r) 
endif 
enddo 



This simple computation should be contrasted with the corresponding GPU version 
developed subsequently. Here, although the fi are read and written multiple times in no 
systematic order, the fact that there is just a single execution thread as well as several 
levels of data cache for mediating transfers between CPU and memory should minimize 
any performance degradation. 
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3.3. Alternative neighbor list organization 



The alternative to tabulating neighbor pairs without any specific ordering is to group 
the entries according to one member of the pair. Data redundancy is eliminated, but 
there is a minor disadvantage that will be indicated below. The neighbors of atom i 
are stored sequentially in t m , with pointing to the first entry; setting the final PN a +i 
ensures the entries for the last atom are properly terminated. After atoms are assigned 
to cells, as before, for each atom i' there are loops over the neighboring cells c" of the 
cell d in which it resides, and then over the atoms i" belonging to c". 

m = 

for i' = 1 to N a do 

r ' = r v + L/2 

c 'x = \f' x /w x \ + 1 (etc.) 

Pi' = m + 1 

for k = 1 to 14 do 

(3, bp and c" (as above) 

1 - QN a +c" 

do while i" > 
if i' < i" then 

r = r v - rin + bp 
if r 2 < r 2 n then 
m = m -\- 1 

t m = i"\P 

endif 
endif 

i" = qv> 
enddo 
enddo 
enddo 

PN a +i = m + 1 

The evaluation of fi is modified to use a double loop over atoms i and over the set of 
t m itemizing i's neighbors. The disadvantage is that if the average number of neighbors 
is small, the overhead of repeatedly initializing the inner loop may be noticeable [3] . 

for i' = 1 to N a do 

for m = pi> to pe+i — 1 do 
i" = t m kB, p = t m &B 
compute (as above) 
enddo 
enddo 
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Each atom pair (i 1 , i") is considered once during the force evaluation, and both fi> and 
fi" are updated. While the atoms indexed by i' are accessed sequentially, the i" atoms ap- 
pear in no particular order. On the GPU, random read followed by write memory accesses 
incur a substantial performance penalty. The alternative is to avoid relying on Newton's 
third law by computing /j> and /j« separately; improved GPU memory performance more 
than compensates for the extra computations. The corresponding modifications to the 
algorithm are minimal: the loop in the neighbor list construction (above) over 14 of the 
neighbor cells is changed to all 27, and the test for i' < i" is replaced by i' ^ i"; only fa 
is updated, and the sum over u yields 2U. The length of the neighbor list will of course 
be doubled. 

3.4. Layer-based neighbor matrix 

A variation of the last approach, based on organizing both the cell contents and the 
neighbor pairs as matrices, leads to an algorithm that goes a long way towards satisfying 
GPU limitations. Cell assignment is as before, with a now used to record the cell con- 
taining atom i. Instead of a linked list, cell contents are organized as a series of layers, 
as used for vector processing [3], where layer I includes the Zth members of all cells (for 
those cells with > I occupants). Layer assignment is trivial, as shown below; k is the 
layer containing atom i, and k c serves as a cell occupancy counter. 

for c = 1 to N c do k c = 
for i = 1 to N a do 

k Ci — k Ci -\- 1 

enddo 

The standard CPU implementation of this layer assignment algorithm would be just 
as shown. When using the layer organization for vector processing, the fact that the Cj 
are not unique (cells typically contain multiple atoms) requires the loop over i to be 
replaced by a more complicated set of operations consistent with vectorization. For the 
GPU version, the increments of k Ci must be carried out as 'atomic' operations to avoid 
conflict. Newer GPUs support certain operations of this kind, but they are relatively 
slow. After testing, it was decided that this evaluation should be carried out on the 
host (the remainder of the work is performed on the GPU); the Cj array (a total of N a 
integers) is copied from the GPU to the host, the values of U computed, and the U array 
(of similar size) copied back to the GPU. The maximum of k Ci determines the number of 
layers in use, iV;; this too is most readily evaluated on the host (on the GPU, a reduction 
operation - discussed later - would be needed). 

Once the Cj and k are available, the cell-layer occupancy matrix H c j can be filled; 
each atom i contributes a nonzero element H Ci j i = i. The row and column indices, c 
and I, specify the cell (1 < c < N c ) and layer (1 < I < iVj); note that while iV c is fixed, 
Ni varies, and the matrix must be able to accommodate the maximum number of layers 
possible. 

The final stage is enumerating the neighbor pairs. The results are recorded in the 
neighbor matrix W m) j, where each column i corresponds to an atom, and row m specifies 
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the mth neighbor of each atom (the order is arbitrary). This layout, assuming the matrix 
to be stored in row order, allows the identities of all mth neighbors to occupy successive 
memory locations (permitting coalesced access by threads processing individual atoms); 
transposing the matrix W m ^ reduces performance by w 9%, showing the sensitivity to 
memory access issues. 

Recording the neighbor pairs involves populating W m ^. As before, the algorithm con- 
siders atoms i' sequentially, and for each there are loops, first over the neighboring cells c" 
of the cell d containing i', and then over layers / to access the neighbor atoms i" — H c " ,i- 
A count of i's neighbors appears in m,. 

for i' = 1 to N a do 

TO = 

c' = cell containing i' (as above) 
for k = 1 to 27 do 

(3, bp and c" (as above) 

for I = 1 to Ni do 

i" — H c ni 

if i" = then break 
if %' ^ i" then 

r = r v - r v , + bp 
if r 2 < r 2 n then 

TO = TO + 1 

W m<v =i"\0 
endif 
endif 
enddo 
enddo 

TOj/ = TO 

enddo 

An alternative way of organizing this task, described in [6] for neighbor pairs and in 
[8] for forces evaluated directly via the cells, amounts to having outer loops over cells and 
their contents, rather than over the atoms themselves (similar to the original neighbor-list 
algorithm, above, but using H c j instead of qi). Such an approach is designed to utilize 
GPU shared memory capability (a subject discussed later), but the disadvantage is that 
the number of threads needed per block is determined by maximum cell occupancy (equal 
to the layer count N{), a number that can be very small; this in turn limits the scalability 
of the computation since it is unable to benefit from large-scale thread parallelism. The 
present approach, in which thread parallelism is limited only by the GPU hardware and 
not by r n (which, in turn, determines cell size and occupancy), is simpler, fully scalable 
and, as shown below, exhibits relative performance varying from similar to several times 
faster. 

Force evaluation is based on W m s- Since evaluation of global sums on the GPU is 
nontrivial (see below), the interaction energy of each atom, Uj, is recorded separately, to 
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be combined at a later stage. Note that quantities that are updated multiple times in the 
innermost loop, namely /,/ and iiy, would be held in temporary (register) storage rather 
than being written to memory at each iteration. 

for i' = 1 to N a do 

fi> = 0, u v = 
for m = 1 to rriii do 

i" = W m<v kB, $ = W m , v hB 
r = r v - r v * + bp 
if r 2 < r 2 then 

fv = fv + f(r) f, u v = u v + u{r) 
endif 
enddo 
enddo 



3.5. Additional details 

Most other elements of the computation remain unchanged; only the overall program- 
ming style needs CUDA adaptation, as discussed below. Generation of the initial state 
- consisting of atoms on a cubic lattice of edge size iV e (N x = N e ), whose velocities Vi 
have magnitude \/3T (T is the temperature) and random direction, adjusted so that the 
system center of mass is at rest - is carried out on the host, and the data then transferred 
to the GPU. 

For performance reasons associated with memory caching when examining neighbors, 
demonstrated later, the sequence in which atoms are stored in memory is periodically 
reordered to ensure that the occupants of each cell are kept together. Reordering involves 
scanning the most recent version of the occupancy matrix H c j in cell order, and it is 
carried out at regular intervals just before rebuilding the neighbor matrix (but after 
periodic boundary adjustments). A more complicated approach is described in [6], where 
the scan follows a fractal-like space-filling curve. Only the coordinate and velocity arrays 
must be reordered, but not the forces which are due to recalculated. Since reordering is 
a comparatively infrequent (and undemanding) operation, the work is carried out on the 
host for simplicity. Array elements k n record the reordered atom indices and the array 
ti (of size N a ) provides temporary storage needed while reordering. 

n = 

for c = 1 to N c do 
for I = 1 to N t do 

i = H c .i 

if i = then break 

n = n + 1 
k n = i 
enddo 
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enddo 

for i = 1 to N a do U = fi 
for i = 1 to N a do r, = ifc 4 

The last remaining task in the basic MD computation is the evaluation of system 
properties requiring sums, or other operations, over all atoms; examples are the total 
kinetic and potential energies, based on summing vf and Ui, and the maximum of vf for 
determining whether the cumulative displacement is large enough to require updating 
the neighbor data. Reduction operations of this kind, trivial on a serial CPU, are more 
complex on the GPU. The technique - demonstrated below - employs a series of partial 
reductions carried out in parallel on the GPU, followed by a final reduction on the host. 

It is worth reiterating that layer assignment is the only task performed on the host on 
a regular basis (its frequency, roughly once every 10-15 time steps, is measured below); 
in addition to the minimal amount of computation entailed by this task, it requires only 
relatively small amounts of data to be transferred to and from the host (an array of 
N a integers in each direction). Apart from this, and with the exception of the one-time 
initialization and the infrequent atom reordering, the entire computation is executed by 
the GPU, and there is no need for any other large, time-consuming data transfers to or 
from the GPU. 

The approach is readily extended to other kinds of MD systems more complicated than 
the spherical atoms considered here. For example, only minor modifications are necessary 
to allow the GPU to handle multiple atomic species with different interactions, polymers 
bound together by internal forces, the velocity-dependent forces used in modeling gran- 
ular systems, and even rigid bodies with multiple interaction sites. There are, however, 
other extensions of the general MD approach, such as the incorporation of geometrical 
constraints or long-range forces, where efficient GPU implementation requires additional 
algorithmic development outside the scope of the present work. 

4. Implementation 

4.1. Design environment 

A brief discussion of the CUDA features used in this work follows; a more compre- 
hensive treatment appears in the programming documentation [2,9] and other material 
available on the Web. Newer and more advanced hardware than used here includes other 
features able to improve performance further; anticipated future gains are likely to come 
primarily from increased parallelism, fewer restrictions on memory access, and faster 
processing. 

The freely available CUDA software environment simplifies development, especially 
because of the cooperation between the GPU and host compilers; thus for simple appli- 
cations, both the C (or other language) host code and the GPU functions, written in C, 
can coexist in the same file for convenience. The necessary libraries are also provided, so 
all that is required is a graphics processor supporting CUDA and its device driver; the 
present study was carried out on computers running the Linux (Fedora 11) operating 
system. 
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The parallel streaming multiprocessors of the GPU are subject to limitations similar to 
those of vector processors regarding data organization and usage. In the context of MD, 
since the basic algorithm implicitly allows concurrent evaluation of the forces between 
multiple pairs of atoms, careful organization is required to ensure that individual contri- 
butions are combined correctly. This supplied the motivation for the algorithm redesign 
described in the previous section. The problem now is to ensure efficient execution, taking 
into account GPU hardware limitations. 

The effect of high memory access latency can be reduced by various means, but it 
turns out to be the limiting factor, otherwise it would be the Gflop ratio that determines 
relative performance. Latency is partially hidden by having a sufficiently large number 
of thread blocks, so that while some are awaiting data from memory others are able to 
execute. Accessing global memory in the correct manner allows coalesced data transfers 
that provide a major increase in effective bandwidth, e.g., adjacent elements in memory 
accessed in parallel by a set of threads. Copying data to shared memory can also im- 
prove memory-related performance, as does the use of texture caching for reading global 
memory when coalesced transfers cannot be arranged; however, both these features are 
of hardware-limited capacity. 

4.2. Programming 

The CUDA MD implementation entails, for the most part, a few simple C extensions 
to support the multithreaded architecture [2,9]. The examples included here are intended 
to provide a taste of the style and conventions used in GPU programming. Except where 
indicated to the contrary, the entire MD simulation is readily adapted for GPU execution. 
The first example shows the function for assigning atoms to cells. The underlying change, 
here and in other segments of the computation that run on the GPU, is the removal of the 
explicit (outermost) loop over atoms from the function and deriving the atom identity 
from the local thread and block indices instead. 

The following code, extracted from the host program and simplified, includes functions 
that allocate memory on the GPU and copy an array from host to GPU, a kernel call with 
a wait for completion (kernel calls are asynchronous), and an error check. The function 
call to CellAtomAssignGPU is a request for the GPU to execute this CUDA function, or 
kernel, in parallel (the fraction of true parallelism is hardware dependent). The notation 
«<nBlock, nThread>>> is an extension to the C function call mechanism specifying an 
execution configuration of nBlock blocks, each containing nThread threads. The total 
thread count, nBlock * nThread, ideally equals nAtom (N a ), although a partially used 
final block is handled correctly; thus, the loop over atoms on the CPU is replaced by a 
single kernel call that processes them all. The only limit to the total thread count (and 
also to N a ) is GPU memory; on the other hand, the number of threads per block is 
limited by hardware resources and the number of active blocks depends on the memory 
requirements per block. Some experimentation may be needed to find the optimal number 
of threads per block. Each thread receives the parameters passed in the call, but otherwise 
has no access to the host. 

The meaning of most variables should be obvious, r and atomlnCell - corresponding 
to fi and Cj - are arrays in GPU memory, f loat3 and f loat4 specify 3- and 4-component 
single-precision values, and int3 is a 3-componcnt integer. The availability of four compo- 
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nents reflects the graphics origin of the device; their use improves memory performance, 
even if the extra padding is space wasted. 

cudaMalloc ((void**) &r, nAtom * sizeof (float4)); 
cudaMemcpy (r, rH, nAtom * sizeof (f loat4) , 

cudaMemcpyHostToDevice) ; 
nThread = 128; 

nBlock = (nAtom + nThread - 1) / nThread; 
CellAtomAssignGPU «<nBlock, nThread»> 

(r, atomlnCell, nAtom, region, invCellWid, cells); 
cudaThreadSynchronize () ; 

if (cudaGetLastError () != cudaSuccess) { ... } 

The kernel CellAtomAssignGPU (below) is executed by the threads of the GPU. Each 
thread processes a single atom whose unique identity, id, is determined from the following 
built-in variables: the number of threads per block, blockDim.x (the .x suffix arises 
from the optional multidimensional indexing of blocks and threads - not needed here), 
the particular block under consideration, blockldx.x, and the thread within the block, 
threadldx.x (recall that C uses 0-bascd indexing). Since nAtom need not be a multiple 

of blockDim.x, a test id < nAtom is included in all kernels. The global prefix 

identifies the function as a CUDA kernel. 

global void CellAtomAssignGPU (float4 *r, int *atomInCell, 

int nAtom, float3 region, float3 invCellWid, int3 cells) 

{ 

int3 cc; 
int id; 

id = blockldx.x * blockDim.x + threadldx.x; 
if (id < nAtom) { 

cc.x = (r[id] .x + 0.5 * region. x) * invCellWid. x; 

atomlnCell [id] = (cc.z * cells. y + cc.y) * cells. x + cc.x; 

} 

} 



4.3. Reduction 

A reduction operation, the simple evaluation of, for example, the sum of a set of 
data values, requires careful implementation on a multithreaded GPU to ensure both 
correctness and efficiency. Several levels of optimized reduction are described in [10]. 
Acceptable efficiency can be obtained by iterative reduction on the GPU, halving the 
number of elements in a block of data until only one remains (extra effort, not warranted 
here, can further double the speed). Since this computation is the most unfamiliar of 
the changes required for the CUDA implementation, the second of the software examples 
demonstrates the technique. The program fragment shown evaluates the potential energy 
sum u ii an d is readily generalized to evaluating several quantities at the same time. 
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This example also demonstrates the use of shared memory, an important feature of 
the GPU for general-purpose computation. Shared memory is not subject to the high 
latency of global memory and is available to all threads in a block for the duration of the 
block's execution. In the kernel call (below) the third argument in <<< . . . >>> specifies the 
amount of shared memory needed (excessive use of shared memory, a limited resource, can 
impact performance by reducing parallelism at the block level) . Because it is meaningless 
for multiple thread blocks to write to a common memory location in an unsynchronized 
manner, the partial result from each block is written to a separate element of the array 
uSumB, of size nBlock, in global memory; the final stage of the reduction occurs on the 
host, after copying uSumB to the corresponding host array uSumH. 

EvalPropsGPU «<nBlock, nThread, nThread * sizeof (float) »> 

(u, uSum, nAtom) ; 
cudaMemcpy (uSumH, uSumB, nBlock * sizeof (float), 

cudaMemcpyDeviceToHost) ; 
uSum = ; 

for (m = 0; m < nBlock; m ++) uSum += uSumH[m] ; 
uSum = 0.5 * uSum; 

When the following kernel is executed there is one thread per atom. The first task is to 
copy that atom's Ui into the shared memory array uSh. Subsequent processing reads from 
and writes to shared memory. Only half the threads participate in the initial iteration 
of the j loop, and the number is successively halved down to unity. The synchronization 

calls to syncthreads () are crucial to ensure data is written by all threads before 

being read subsequently by other threads. The eventual result is stored in the element 
of uSumB corresponding to the block. 

global void EvalPropsGPU (float *u, float *uSumB, int nAtom) 

{ 

extern shared float uSh[] ; 

int id, j; 

id = blockldx.x * blockDim.x + threadldx.x; 
uSh [threadldx . x] = (id < nAtom) ? u[id] : 0; 
syncthreads () ; 

for (j = blockDim.x / 2 ; j > ; j = j / 2) { 

if (j > threadldx.x) uSh [threadldx.x] += uSh [threadldx.x + j] ; 
syncthreads () ; 

} 

if (threadldx.x == 0) uSumB [blockldx.x] = uSh[0]; 

} 



4.4. Texture caching 

The origin of the GPU as a graphics processor is reflected in the ability to use a cache 
mechanism for efficiently reading stored textures. This can be put to general use by 
binding a data array in GPU global memory to a texture, as in the following (extended 
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C) host code (the need to invoke graphics capabilities in this way is rare). 

texture <float4, 1, cudaReadModeElementType> texRefR; 
cudaBindTexture (NULL, texRefR, r, nAtom * sizeof (float4)); 

Then, rather than the GPU reading coordinates via, for example, rC = r [id] , the 
function call rC = texlDf etch (texRefR, id) allows the data access to take advantage 
of the cache. The resulting performance gain when processing atom pairs (where only the 
first member of each pair is accessed sequentially) , especially if the data is reordered so 
that nearby atoms are also stored in nearby memory locations (as much as possible), will 
be demonstrated later. Texture caching is also used for reading the cell-layer occupancy 
matrix H c j when recording neighbor pairs. 

5. Performance measurements 

5.1. Test environment 

The GPU model used in the tests is the NVIDIA® Quadro FX770M. This GPU is 
designed for laptop computers, an environment subject to strict power and thermal lim- 
itations; thus there are only 32 CUDA stream processors (or four streaming multipro- 
cessors), 512 Mbytes memory and a 26 Gbyte/s memory bandwidth (with CUDA level 
1.1 support). Other new and recent GPUs have greatly enhanced performance: several 
times the number of processors and higher bandwidth, together with fewer limitations 
on efficient memory access and support for double-precision arithmetic. 

Performance tests have been carried out for two MD systems. One consists of atoms 
interacting with the LJ potential; it is studied for comparison with earlier work [6] and 
has similar parameter settings, T = 1.2, p — 0.38, r c = 3, and S = 0.8. The other 
system involves the very short-range SP potential; it is explored in greater depth, with 
parameters T = 1, p = 0.8, r c = 2 1//6 , and (in most cases) 8 = 0.6. Other details, common 
to both systems, are as follows: The integration time step is At = 0.005 (MD units), 
averages are evaluated over blocks of 1000 time steps, there is an initial equilibration 
period of 500 steps during which velocities are rescaled every 20 steps to achieve a 
mean T close to that required. Runs are normally of length 6000 steps, which is ample 
for performance measurement (except when examining the effect of reordering). Unless 
stated otherwise, 128 threads are used, the texture cache is also used, and coordinate 
reordering is applied every 100 steps. 

5.2. Size dependence 

GPU performance over a range of system sizes is summarized in Table 1; times are 
expressed in p,s per atom-step (i.e., measured wall clock time per step divided by N a ) 
and are reproducible with minimal variation (assuming the GPU is not borrowed by 
other tasks). The measurements show minimal size dependence beyond that attributable 
to variations in the mean number of atoms per cell N a /N c . Other performance-related 
observations include the following: For the SP case, with the corresponding LJ values 
shown in parentheses, there are typically N p /N a = 15 (89) entries in the neighbor list 
per atom, Ni — 8-10 (40-50) layers, and the neighbor list must be updated every N u = 
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Table 1 

Size dependence for soft sphere (SP) and Lennard-Jones (LJ) systems; the numbers of atoms (N a = N% ), 
cells (N c ) and the times per atom-step (t) in fis are shown. 
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Table 2 

Performance comparisons - GPU vs CPU. 
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0.068 


11.7 
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40 


4.480 


0.240 


18.7 



11-12(14-15) steps. For the largest systems, fractions t n /t — 0.38(0.35) and tf/t = 
0.37(0.58) of the total computation time, respectively, are used for neighbor list con- 
struction and force calculation (for SP, periodic boundary adjustment, atom reordering, 
and cell and layer assignment together account for a fraction 0.03 of the time, even less 
for LJ). The computations require GPU storage of w 220(770) bytes/atom; the largest 
systems, with N a w 8.8 x 10 5 (2.6 x 10 5 ), both need w 200 Mbytes. 

5.3. GPU vs CPU and other performance comparisons 

The most important result is the magnitude of the performance improvement relative 
to a more conventional CPU. The C version of the layer-matrix program (corresponding 
to the CUDA version) was run on a Dell Precision 470 workstation with a 3.6GHz Intel 
Xeon processor, similar to, but slightly faster (nominally 1.2x) than that used in [6], 
and compiled with maximum optimization. Timings appear in Table 2, together with the 
speedup factor. The gain is impressive, and in the case of LJ, consistent with [6] which 
used a GPU (NVIDIA GeForce 8800 GTX) with 4x the number of processors (128 vs 32) 
and over 3x memory bandwidth (86 vs 26 Gbyte/s, reflecting a correspondingly wider 
memory interface); the approach in [6] is itself several times faster than [7] (which used 
cells but not neighbors) on the same model GPU. 

Another important comparison addresses the performance of the old (Ref. [6]) and new 
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Table 3 

Performance comparisons - new vs old ([6]) algorithms. 
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64 


0.068 


0.168 
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methods as r c is varied, particularly since the earlier work considered an LJ system with 
only a single, relatively large value of r c . The algorithm used to tabulate neighbors in 
[6] is readily incorporated into the present program (after correcting a few minor errors) 
since similar matrix organization is used to represent cell and neighbor data. The tim- 
ing measurements appear in Table 3. For the largest r c the two methods exhibit similar 
performance, as indicated above, but as r c is lowered (using values from [11,12]) the 
performance figures begin to favor the new approach, culminating in an almost 2.5x gain 
in the SP case. The improvement is due solely to the modified approach to neighbor tab- 
ulation (which, unlike the old method, does not require a large r c to benefit from thread 
parallelism). For the SP system, the time fractions used for neighbor list construction are 
t n /t = 0.34 and 0.74 for the new and old methods, respectively; the latter value reveals 
how this portion of the computation dominates in the old method. The present new ap- 
proach is expected to show further improvements in performance when used with more 
advanced (current and future) GPUs incorporating greatly increased parallel capability. 

There is a performance loss associated with the alterations that were made to the MD 
algorithm. A scries of measurements for each of the versions of the neighbor list and force 
computations conducted on the iV e = 48 SP system, subsequently referred to as S, were 
run on an Intel T9600 CPU (2x the speed of the 3.6GHz Xeon) belonging to the laptop 
with the GPU (HP EliteBook 8530w). Relative to the base algorithm that used a list 
of neighbor pairs, grouping the neighbors by atom (with the extra inner loop) increases 
the time by 1.16x, relinquishing Newton's third law a further 1.42x, and the use of the 
layer matrix another 1.17x; the cumulative increase is 1.92x. Thus, in order to reap the 
benefits of the GPU hardware it is necessary to work with an algorithm roughly half 
as efficient as the original, but the loss is more than justified by the net performance 
gain. A similar situation arose when adapting MD for vector processing; whether some 
of the other changes made to aid vector efficiency might be helpful here remains to be 
investigated. 

The more processing the GPU can apply to data retrieved from memory (the 'arith- 
metic intensity') the greater the expectation for improved performance. With short-range 
MD, and the SP problem in particular, atoms have few interaction partners, a fact that 
offers a difficult challenge for the GPU when compared to a CPU. Given the positive 
outcome of the comparison, despite the unfavorable nature of the problem, the future of 
the GPU-based approach appears promising. 
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Table 4 

Test of energy conservation showing mean total and kinetic energy per atom, (etot) and (efcj n ), averaged 
over the 1000 steps preceding N s tep- 



Nstep 


(etot) 




2000 


2.3260603 


1.5000234 


100000 


2.3265371 


1.5005983 



Table 5 

Dependence on shell thickness (<5) for system S; the values shown are the number of cells (N c ), the mean 
number of neighbors (N p /N a ), the maximum number of layers (Ni), the mean number of steps between 
neighbor updates (N u ), the fractions of computation time devoted to neighbor processing (t n /t) and 
force calculation (tf/t), and the time per atom-step (t). 
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21.64 


11 


15.9 
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0.075 



5.4. Energy conservation 

Energy conservation is an essential requirement for any MD simulation in the micro- 
canonical ensemble. It is simple to test whether the fact that the GPU used here is limited 
to single-precision arithmetic affects this capability. Table 4 compares the values at the 
beginning and end of a longer run (excluding the first 1000 steps that are influenced by 
velocity rescaling during equilibration). The measured drift for system S is 1 part in 5000 
over 10 5 steps, a more than acceptable value. 

5.5. Parameter dependence 

Table 5 shows the dependence on 6, the shell thickness, for system S. As <5 increases 
fewer cells are used, and there are increases in both the layer count Ni and the interval 
between neighbor list updates N u . Computation time reflects the decreasing amount of 
work required for the neighbor lists and the corresponding increase for the forces (al- 
though the overall variation is fairly small over the 5 range considered) , with a minimum 
in the vicinity of S = 0.6 as used previously. 

The number of threads per block, Nt, is a runtime parameter that must be determined 
empirically, and will vary with GPU capability as well as with problem type and size; 
Nt should exceed and be a multiple of the number of GPU processors (here 32) . Table 6 
shows the performance of system S with different Nt; processors must be kept busy, 
while not overutilizing resources available to the threads. The choice of 128 threads used 
throughout the study appears justified. 

The importance of reordering based on the atom coordinates as a means to improving 
memory access times was mentioned earlier. Table 7 shows the effect of varying the 
nominal number of steps Nu between reorderings (the operation is carried out when the 
next neighbor list rebuild falls due) for system S. Frequent reordering clearly makes an 
important contribution to performance; the value Nr — 100 is a reasonable choice. 
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Table 6 

Dependence of time per atom-step on thread count (iVy). 
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Table 7 

Dependence of time per atom-step on reorder interval (Nr). 
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Table 8 

Increasing time per atom-step when reordering is omitted. 



N s tep * 



1000 


0.088 


2000 


0.105 


5000 


0.127 


10000 


0.151 


20000 


0.183 


40000 


0.214 


60000 


0.232 


80000 


0.242 


100000 


0.249 



Table 8 shows the consequences of failing to reorder, an eventual 4x performance drop 
relative to the optimal case. The use of the texture cache as a means of improving memory 
access has also been tested; overall computation time for system S is nearly doubled 
(actually 1.85x) without the cache, demonstrating its importance in compensating for 
the effects of high memory latency. 

6. Conclusion 

The present work has been based on a rather modest GPU designed for laptop comput- 
ers, whose performance lags behind current high-end devices and, even more so, behind 
products scheduled (at the time of writing) to appear in the near future. Nevertheless, the 
processing speed is found to be considerably higher than a typical CPU, a goal achieved 
without encountering any major algorithmic or programming obstacles. An especially 
important feature of the present MD approach for short-range interactions is that, unlike 
previous work, it is completely scalable, enabling it to benefit fully from the inherent 
parallelism of the hardware. 
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It is reasonable to expect that further improvements in GPU design and performance, 
particularly the enhanced parallel-processing capability, will provide a major advance in 
affordable MD simulation at the single-GPU level for models of both simple and more 
complex systems, as well as enabling its utilization as a convenient building block for 
massively parallel supercomputers aimed at extending the realm of feasible simulation. 
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